Univeristy of Michigan Biostatistics 699, Winter 2022

Notation

  • \(X\) is set of predictors

  • \(Y\) is an outcome (could be continuous or discrete, including binary)

  • \(\hat Y (X)\) is a prediction of \(Y\) based on \(X\)

“Loss functions”: a function to measure model fit

smaller is better:

  • Squared error: \((Y-\hat Y (X))^2\)

  • Absolute error: \(|Y-\hat Y (X)|\)

  • Misclassification: \(1 - 1_{[Y=\hat Y(X)]}\)

  • Asymmetric versions of the above

  • Negative log-likelihood (if there is statistical model): \(-2\log f(Y|\hat \theta)\)

Possible recipe for model selection

  1. Calculate the loss for each of your fitted models

  2. The model with the smallest loss is the ‘winner’

When using same data for both model fitting and model comparison, most overfit model will be selected

Using replacing negative log-likelihood with AIC criterion undoes this preference for overfit models

Compelling reasons to use AIC

  • Closed-form, one step calculation

  • Applicable to any model in which number of estimated parameters can be properly counted

  • Estimates “in-sample” likelihood

  • Useful for non-nested comparison

  • Corrected version makes it useful even when \(n\) is small relative to number of parameters

Scenarios when AIC may not be useful

  • Likelihood does not exist or is not of interest

  • Cannot easily count model size

Cross-validation is more general alternative

Ideal scenario for model selection: split samples

  • training data used for model fitting

  • Use these data to fit all models under consideration

  • validation data used for model comparison and selection

  • “estimating the performance of different models in order to choose the best one” (Hastie, Tibshirani, and Friedman 2009, p222)

  • testing data used for fitting selected model and conducting inference / prediction

  • “having chosen a final model, estimating its prediction error…on new data” (Hastie, Tibshirani, and Friedman 2009, p222)

See also Berk, Brown, and Zhao (2010)

Racing analogy

  • training: design and build car. Everyone gets to practice on same course

  • validation: all cars race on new course, and choose the winner

  • (inappropriate/wrong) testing: use the winning time from validation to predict how winner will continue to perform

  • (correct) testing: race the winning car by itself on yet another new course

Comparing / contrasting AIC and cross-validation

  • Both AIC and cross-validation allow you to use same data for both training and validation steps

  • AIC does not extend to loss functions beyond likelihood

  • Cross-validation allows you to choose loss functions other than negative-log-likelihood

  • Neither solves the problem of post-selection inference, i.e. still need testing data

How does cross-validation work?

  1. Have \(n\) observations in data

  2. For \(j=1,\ldots\)

2.a Create random partition into \(K\) equally sized parts containing \(n/K\) observations. Each part is “fold”

2.b For \(\ell=1,\ldots,K\)

  • Treat \(\ell\)th fold as validation data, remaining \(K-1\) folds collapsed together as training data. Fit candidate models

  • Numerically evaluate model, i.e. calculate log-likelihood, prediction error, misclassification rate, or other loss function, on \(\ell\)th fold

  1. Average loss across all values of \(\ell\) and \(j\), choose model with best value

  2. Refit selected model to full data

cf. Section 7.10, Hastie, Tibshirani, and Friedman (2009)

How does cross-validation work? (alternative formulation)

  1. Have \(n\) observations in data

  2. For \(j=1,\ldots\)

2.a Fit candidate models to randomly selected X% of data (training data)

2.b Numerically evaluate models, i.e. calculate log-likelihood, prediction error, misclassification rate, or other loss function, on remaining (100-X)% of data (validation data)

  1. Average across all values of \(j\), choose model with best value

  2. Refit selected model to full data

Uses for CV

  • Any outcome-dependent decision made in model selection should be cross-validated

Ridge / Lasso / Elastic net

Forward / backward selection are limited

  • Resulting models tend to be overfit and unstable (to perturbations in data, changes to choice of significance threshold)

  • Backward selection not possible in \(p > n\) scenarios

  • Selecting variables may not be sufficient. Additionally / alternatively, may want to control variability of parameter estimates that make it into model

  • Ridge regression (Hoerl and Kennard 1970), LASSO (Tibshirani 1996), and Elastic Net (Zou and Hastie 2005)

Two perspectives

  • Ridge / Lasso / ENet as alternatives to maximum likelihood estimates (MLEs)

  • Lasso as an alternative to stepwise procedures for model selection

Maximum likelihood

Maximum of \(f(Y|\beta)\) gives the \[ \begin{aligned} \hat\beta_\mathrm{MLE} = \mathrm{argmax}_{\beta} f(Y|\beta) \end{aligned} \]

Ridge and Lasso constrain the likelihood

Limit total size of final estimates. Maximize \(f(Y|\theta)\) subject to \[ \begin{aligned} \mathrm{Ridge}: \mathrm{argmax}_{\beta} f(Y|\beta),\quad\mathrm{s.t.}\quad ||{\beta}||_2 \leq t\\ \mathrm{Lasso}: \mathrm{argmax}_{\beta} f(Y|\beta),\quad\mathrm{s.t.}\quad ||{\beta}||_1 \leq t \end{aligned} \] where \(||{\beta}||_2 = \sqrt{\sum_{j=1}^p \beta_j^2}\) and \(||{\beta}||_1 = \sum_{j=1}^p |\beta_j|\) and \(t\) is a “tuning parameter” (auxiliary parameter that cannot be directly estimated by data)

What do these constraints look like?

Example 1: MLEs are weakly correlated

Smaller polygons correspond to smaller \(t\)s

Example 2: MLEs are highly correlated

Smaller polygons correspond to smaller \(t\)s

Constrained versus penalized MLE

  • Idea of constraint intuitively very satisfying but would be helpful if this could be characterized as an unconstrained problem

  • Using Langrangian ideas, it can be shown \[ \begin{aligned} &\mathrm{argmax}_{\beta} f(Y|\beta),\quad\mathrm{s.t.}\quad ||\beta||_2 \leq t\\ & = \mathrm{argmax}_{\beta} f(Y|\beta),\quad\mathrm{s.t.}\quad ||\beta||_2^2 \leq t^2\\ & = \mathrm{argmax}_{\beta} \left\{f(Y|\beta) - \lambda\left(||\beta||_2^2 - t^2\right)\right\}\\ & = \mathrm{argmax}_{\beta} \left\{f(Y|\beta) - \lambda||\beta||_2^2 \right\} \end{aligned} \]

  • Now \(\lambda\) is tuning parameter, sharing \(1-1\) correspondence with \(t\) (large \(t\) \(\Rightarrow\) small \(\lambda\))

  • For fixed \(\lambda\), maximize \(f(Y|\beta) - \lambda||\beta||_2^2\)

Good and Gaskins (1971)

Mean squared error (MSE) as measure of efficiency

\[ \begin{aligned} \mathrm{MSE}(\hat\beta)&=E\left[\left(\hat\beta-\beta\right)^\top\left(\hat\beta-\beta\right)\right]\\ &=E\left(\hat\beta^\top\hat\beta\right) - 2E\left(\hat\beta\right)^\top\beta + \beta^\top\beta\\ &=E\left(\hat\beta^\top\hat\beta\right) - E\left(\hat\beta\right)^\top E\left(\hat\beta\right)\\ &\quad + E\left(\hat\beta\right)^\top E\left(\hat\beta\right) -2E\left(\hat\beta\right)^\top\beta + \beta^\top\beta\\ &= \sum_{j=1}^p \mathrm{Var}(\hat\beta_j) + \left[E\left(\hat\beta\right)-\beta\right]^\top\left[E\left(\hat\beta\right)-\beta\right]\\ &= \sum_{j=1}^p \mathrm{Var}(\hat\beta_j) + \sum_{j=1}^p \mathrm{Bias}(\hat\beta_j)^2\\ &= \mathrm{Trace} \mathrm{Var}(\hat\beta) + \mathrm{Bias}(\hat\beta)^\top\mathrm{Bias}(\hat\beta) \end{aligned} \]

MSE of MLE

Ignoring the finite-sample bias, \[ \begin{aligned} \mathrm{MSE}\left(\hat\beta_\mathrm{MLE}\right) \approx \sigma^2 \mathrm{Trace}\left[\left( X^\top X\right)^{-1}\right] = \sigma^2 \sum_{j=1}^ p \dfrac{1}{e_j} \end{aligned} \]

where \(\sigma^2\) is the error variance and \(e_j\) is the \(j\)th eigenvalue of \(X^\top X\)

MLE from an efficiency perspective

  • If focus is on (asymptotically) unbiased estimators, then \(\hat\beta_\mathrm{MLE}\) is usually preferred

  • But its variance, and therefore MSE, may be large if eigenvalues are small (one symptom of collinearity of predictors and/or high-dimension of predictor vector)

  • Can we decrease MSE by allowing for some bias?

Motivation for ridge: decreasing MSE

Ridge estimator in a linear regression is available in closed-form (unlike Lasso). For a normal likelihood,

\[ \begin{aligned} \hat\beta_\mathrm{R}(\lambda) \equiv \mathrm{argmax}_{\beta} \left\{f(Y|\beta) - \lambda||\beta||_2^2 \right\} = \left( X^\top X + \lambda I_p\right)^{-1} X^\top Y \end{aligned} \]

\[ \begin{aligned} \mathrm{MSE}\left(\hat\beta_\mathrm{R}(\lambda)\right) &= \mathrm{Variance} + \mathrm{Bias}^2 \\ & = \sigma^2 \sum_{j=1}^p \dfrac{e_j}{(e_j+\lambda)^2} + \lambda^2 \sum_{j=1}^p\dfrac{\alpha_j^2}{(e_j+\lambda)^2}, \end{aligned} \] where \(\alpha = Q \beta\), \(Q\) being the eigenvectors of \(X^T X\)

MSE comparison

  • Since \(\frac{e_j}{(e_j+\lambda)^2} < \frac{1}{e_j}\), the ridge estimator (\(\hat\beta_\mathrm{R}(\lambda)\)) will always have lower variance than the MLE (\(\hat\beta_\mathrm{MLE}\))

  • However, this decrease in variance should not be at the expense of introducing too much bias (the \(\dfrac{\alpha_j^2}{(e_j+\lambda)^2}\) term)

  • This is called “bias-variance tradeoff”

Bias-variance tradeoff visualized

Hoerl and Kennard (1970) showed that there is always \(\lambda >0\) that decreases MSE

Lasso

  • Lasso does not have a closed-form solution

  • Lasso conducts variable selection penalty + shrinkage

  • Ridge will never set estimates exactly to zero, but it does do shrinkage. It outperforms the Lasso when predictors are highly correlated

Back to cross-validation

  1. Have \(n\) observations in data and grid of tuning parameters \(\{\lambda_1 \approx 0, \lambda_2, \ldots, \lambda_M = \infty\}\)

  2. For \(j=1,\ldots\)

2.a Create random partition into \(K\) equally sized parts containing \(n/K\) observations. Each part is a fold

2.b For \(\ell=1,\ldots,K\)

  • Treat \(\ell\)th fold as validation data, remaining \(K-1\) folds collapsed together as training data. Fit model to each value of \(\lambda_m\)

  • Numerically evaluate model using each \(\lambda_m\), i.e. calculate log-likelihood, prediction error, misclassification rate, or other loss function, on \(\ell\)th fold

  1. Average across all values of \(\ell\) and \(j\), choose model (\(\lambda_m\)) with best value

  2. Refit selected model (\(\lambda_m\)) to full data

R package glmnet

  • Implements ridge and lasso penalties, and continuous mixture of the two (elastic net (Zou and Hastie 2005))

\[ \begin{aligned} \mathrm{argmax}_{\beta} \left\{f(Y|\beta) - \lambda (1-\alpha)/2||\beta||_2^2 + \lambda \alpha ||\beta||_1\right\} \end{aligned} \]

  • Implements linear, logistic, and Poisson likelihoods and Cox partial likelihoods

  • Use glmnet when you have your specific tuning parameters in mind

  • Use cv.glmnet to use cross-validation to select your tuning parameters (you nearly always need to use cv.glmnet)

  • cv.glmnet auto-selects grid of \(\lambda\) values; uses cross-validation to pick \(\lambda\)

  • If you also want to select \(\alpha\) via cross-validation, you must do so yourself

  • Use glmnetUtils for some “Some quality-of-life functions to streamline the process of fitting elastic net models with the glmnet package”

Very helpful vignettes

args(glmnet::cv.glmnet);
## function (x, y, weights = NULL, offset = NULL, lambda = NULL, 
##     type.measure = c("default", "mse", "deviance", "class", "auc", 
##         "mae", "C"), nfolds = 10, foldid = NULL, alignment = c("lambda", 
##         "fraction"), grouped = TRUE, keep = FALSE, parallel = FALSE, 
##     gamma = c(0, 0.25, 0.5, 0.75, 1), relax = FALSE, trace.it = 0, 
##     ...) 
## NULL
  • Model training always done via maximum penalized likelihood

  • You can choose the loss function for model selection (not all options apply to all outcome types)

glmnet example 1

library(glmnet);
set.seed(100);
n = 200;#number obs
p = 1e3;#number covariates
rho = 0.025;#compound symmetric correlation
true_beta = c(0.3, 0.3, numeric(p - 2));
chol_x = chol(matrix(rho, nrow = p, ncol = p) + 
                diag(1 - rho, nrow = p));
sigma_y = sqrt(0.20);
x = matrix(rnorm(n * p), nrow = n) %*% chol_x;
y = x%*%true_beta + rnorm(n, sd = sigma_y);
  • Two non-null regression coefficients; 998 null
  • Small-ish correlation between predictors
  • The proportion of variance explained by the regression is \(R^2 = \mathrm{Var}(X\beta) / (\mathrm{Var}(X\beta) + \sigma^2) = 1/ \left(1+ \tfrac{\sigma^2}{\mathrm{Var}(X\beta)}\right)\):
1 / (1 + sigma_y^2 / drop(crossprod(chol_x%*%true_beta)))
## [1] 0.48

glmnet example 1: ridge

plot(glmnet(x, y, alpha = 0), xvar = "lambda")

glmnet example 1: ridge

ex1_ridge_cv = cv.glmnet(x, y, nfolds = 5, alpha = 0);
plot(ex1_ridge_cv);

glmnet example 1: ridge

ex1_ridge <- 
  glmnet(x, y, lambda = ex1_ridge_cv$lambda.1se, alpha = 0) %>% 
  tidy() %>%
  filter(term != "(Intercept)") %>%
  bind_cols(truth = true_beta, 
            method = "ridge", 
            partition = 1)

ggplot(ex1_ridge) + 
  geom_point(aes(x = truth,  
                 y = estimate)) + 
  geom_abline(intercept = 0, slope = 1) + 
  theme(text = element_text(size = 24))

glmnet example 1: lasso

plot(glmnet(x, y, alpha = 1), xvar = "lambda")

glmnet example 1: lasso

ex1_lasso_cv = cv.glmnet(x, y, nfolds = 5, alpha = 1);
plot(ex1_lasso_cv);

glmnet example 1: lasso

ex1_lasso <- 
  glmnet(x, y, lambda = ex1_lasso_cv$lambda.1se, alpha = 1) %>%
  tidy(return_zeros = TRUE) %>%
  filter(term != "(Intercept)") %>%
  bind_cols(truth = true_beta, 
            method = "lasso", 
            partition = 1)

ggplot(ex1_lasso) + 
  geom_point(aes(x = truth,  
                 y = estimate)) + 
  geom_abline(intercept = 0, slope = 1) + 
  theme(text = element_text(size = 24))

glmnet example 1: elastic net

lasso_fraction = 0.5;
plot(glmnet(x, y, alpha = lasso_fraction), xvar = "lambda")

glmnet example 1: elastic net

ex1_enet_cv = cv.glmnet(x, y, nfolds = 5, alpha = lasso_fraction);
plot(ex1_enet_cv);

glmnet example 1: elastic net

ex1_enet <- 
  glmnet(x, y, lambda = ex1_enet_cv$lambda.1se, alpha = lasso_fraction) %>% 
  tidy(return_zeros = TRUE) %>%
  filter(term != "(Intercept)") %>%
  bind_cols(truth = true_beta, 
            method = "enet", 
            partition = 1)

ggplot(ex1_enet) + 
  geom_point(aes(x = truth,  
                 y = estimate)) + 
  geom_abline(intercept = 0, slope = 1) + 
  theme(text = element_text(size = 24))

Example 1: comparison across methods and 2 partitions

RMSE \(= \sqrt{\sum_{i=1}^p (\beta_i-\hat\beta_i)^2/p}\)

method partition RMSE(*10000)
ridge 1 133.9
ridge 2 133.9
lasso 1 68.0
lasso 2 68.0
enet 1 58.0
enet 2 65.1

glmnet example 2

set.seed(100);
n = 200;#number obs
p = 1e3;#number covariates
rho = 0.4;#compound symmetric correlation
true_beta = 0.6/p + numeric(p);
chol_x = chol(matrix(rho, nrow = p, ncol = p) + 
                diag(1 - rho, nrow = p));
sigma_y = sqrt(0.20);
x = matrix(rnorm(n * p), nrow = n) %*% chol_x;
y = x%*%true_beta + rnorm(n, sd = sigma_y);
  • 1000 (tiny) non-null regression coefficients
  • Substantial correlation between predictors
  • The proportion of variance explained by the regression is \(R^2 = \mathrm{Var}(X\beta) / (\mathrm{Var}(X\beta) + \sigma^2) = 1/ \left(1+ \tfrac{\sigma^2}{\mathrm{Var}(X\beta)}\right)\):
1 / (1 + sigma_y^2 / drop(crossprod(chol_x%*%true_beta)))
## [1] 0.419

glmnet example 2: ridge

glmnet example 2: ridge

glmnet example 2: lasso

glmnet example 2: lasso

glmnet example 2: elastic net

glmnet example 2: elastic net

Example 2: comparison across methods and 2 partitions

RMSE \(= \sqrt{\sum_{i=1}^p (\beta_i-\hat\beta_i)^2/p}\)

method partition RMSE(*10000)
ridge 1 2.64
ridge 2 3.06
lasso 1 28.79
lasso 2 27.98
enet 1 27.89
enet 2 27.89

Lasso as variable selection

Can also use Lasso as alternative to stepwise procedures:

  1. Identify best Lasso model using above steps
  2. Final model includes only selected covariates but fit using MLE

Example 2: comparison of MLE on Lasso-selected variables against Lasso

Comparison of three models fit to breast DX data

standardized log-OR \(\equiv \beta_i / \mathrm{sd}(X_i)\)

Full model Most common (bootstrap) Forward selected Smallest AIC (bootstrap) Ridge Enet Lasso
area_worst 10.69 11.10 11.57 8.22 0.63 1.05
compactness_worst -1.31 -1.47 0.17
concavepoints_worst 2.47 2.90 2.53 2.42 0.59 0.97 1.31
concavity_worst 0.96 1.06 0.33 0.31 0.19
fractal_dimension_worst -0.14 -0.03
perimeter_worst 0.44 -2.92 -3.19 0.68 1.11
radius_worst -2.59 0.71 1.28 3.55
smoothness_worst 1.23 1.06 1.05 1.20 0.31 0.47 0.44
symmetry_worst 0.60 0.43 0.58 0.23 0.24 0.18
texture_worst 1.74 1.73 1.72 1.75 0.49 0.84 0.93

Some loose ends

Could we have used AIC here? For linear ridge regression, we have

\[ \begin{aligned} p_\mathrm{eff}(\lambda) = \mathrm{Trace}\left[ X \left( X^\top X + n\lambda I_p\right)^{-1} X^\top\right] < \mathrm{Trace}\left[ X \left( X^\top X\right)^{-1} X^\top\right] = p \end{aligned} \] So, sometimes yes but not generally possible for non-linear estimators or for non-ridge penalties

Some loose ends

Subset selection can be written as penalized likelihood problem:

\[ \begin{aligned} \hat\beta_\mathrm{BSS}(\lambda) \equiv \mathrm{argmax}_{\beta} \left\{f(Y|\beta) - \lambda||\beta||_0 \right\}, \\ ||\beta||_0 = \sum_{j=1}^p 1_{[\beta_j\neq 0]} \end{aligned} \] - Cost of \(\lambda\) for each non-zero coefficient estimate, but no additional penalty for magnitude of non-zero coefficient estimates.

  • \(2^p\) models to search over

Closing thoughts

  • Focus here was on model selection / variable selection

  • Less focus on inference, which is difficult to do after model selection

  • Re-sampling ideas for inference also exist

References

Berk, Richard, Lawrence Brown, and Linda Zhao. 2010. “Statistical Inference After Model Selection.” Journal of Quantitative Criminology 26 (2): 217–36.

Good, I. J., and R. A. Gaskins. 1971. “Nonparametric Roughness Penalties for Probability Densities.” Biometrika 58 (2): 255–77.

Hastie, T, R Tibshirani, and J Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York, NY: Springer.

Hoerl, Arthur E., and Robert W. Kennard. 1970. Ridge regression: Biased estimation for nonorthogonal problems.” Technometrics 12: 55–67.

Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society: Series B 58: 267–88.

Zou, Hui, and Trevor Hastie. 2005. “Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Society, Series B 67: 301–20.